DIAC+: a Professional Diacritics Recovering System
نویسندگان
چکیده
In languages that use diacritical characters, if these special signs are stripped-off from a word, the resulted string of characters may not exist in the language, and therefore its normative form is, in general, easy to recover. However, this is not always the case, as presence or absence of a diacritical sign attached to a base letter of a word which exists in both variants, may change its grammatical properties or even the meaning, making the recovery of the missing diacritics a difficult task, not only for a program but sometimes even for a human reader. We describe and evaluate an accurate knowledge-based system for automatic recovering the missing diacritics in MSOffice documents written in Romanian. For the rare cases when the system is not able to reliably make a decision, it either provides the user a list of words with their recovery suggestions, or probabilistically choose one of the possible changes, but leaves a trace (a highlighted comment) on each word the modification of which was uncertain.
منابع مشابه
Diacritics Restoration in Romanian Texts
There are several languages that use diacritical characters outside the ASCII charset. For some of the languages, most diacritical characters can be deterministically recovered but in general, this is not the prevailing case. However, the difficulty of the task differs from language to language depending on the functional role of the diacritical characters. For Romanian, automatic restoration o...
متن کاملAutomatic diacritization of Arabic transcripts for automatic speech recognition
Arabic orthography does not provide full vocalization of the text, and the reader is expected to infer short vowels from the context of the sentence. Inferring the full form of a word is useful when developing Arabic speech and language processing tools, since it is likely to reduce ambiguity in these tasks. In this paper, we present generative techniques for recovering vowels and other diacrit...
متن کاملReconstruction of Polish diacritics in a text-to-speech system
This paper describes an approach to reconstruction of the Polish diacritic signs, needed e.g. in a speech synthesis system. Some telecommunication services (for example SMS transmission in GSM) remove diacritics from the text. Without them the text is usually still understandable to a reader, but if a TTS system reads it, the speech becomes heavily distorted. In this paper we propose to use neu...
متن کاملInstant Diacritics Restoration System for Sindhi Accent Prediction using N-Gram and Memory-Based Learning Approaches
--The script of Sindhi Language is highly complex due to many complexities including abundance of homographic words. The interpretation of the text turns so tough due to the possibility of multitudinal meanings associated with a homographic word unless given specific pronunciation with the help of diacritics. Diacritics help the readers to comprehend the text easily. Due to the rapidly developi...
متن کاملA robust diacritics restoration system using unreliable raw text data
Statistical language models are utilized in many speech processing algorithms, e.g., automatic speech recognition (ASR). Such a model is created from a text corpus, but many of the text corpora for Romanian are unreliable with respect to the use of diacritic marks, i.e., diacritics are either partially or completely missing, resulting in low quality language models. We present a methodology for...
متن کامل